Handle incompatible records between the 1.3 and 1.4 release#120
Handle incompatible records between the 1.3 and 1.4 release#120cmeiklejohn wants to merge 15 commits into1.4from
Conversation
Suppress the display of RAM and related information if the node is known to be incompatible.
src/riak_control_session.erl
Outdated
There was a problem hiding this comment.
What's wrong with matching against #member_info{} in this case? It seems the main difference is that the new record has additional fields.
A capability might also help, allowing new nodes to send the original record to old nodes.
There was a problem hiding this comment.
Can't match member_info because we can't guarantee a specific format for it. I've added a capability for handling legacy record formats, but we still don't have a solution for 1.4.0/1.4.1 compatibility.
Does this mean that a mixed cluster where the Riak Control node is not updated first will cause the Riak Control node to continually go down because of max restarts? |
|
I've added a capability that will prevent 1.4.1 nodes from sending the newer record to 1.2.x and 1.3.x nodes, however, there's still a situation where a 1.4.0 node is running Control and asks for information from a 1.4.1. |
Handle race condition by triggering a badrpc exception, which has backwards compatibility with <= 1.3. Craft custom messages for 1.4.0 with the bad record format.
Capability negotiation has a race condition, so it can't be relied on to determine 1.4.0 as it will look like a 1.4.0 momentarily if the node is actual a 1.3.x node.
Capability negotation is problematic because race conditions in negotation.
|
I'm working on a riak test to verify this patch (so we don't have to test this by hand in the future). I've verified the crash upgrading from 1.2.1 to a recent master of Riak. I pasted the observed error below. I figure 1.2.1 -> current proves same issue as 1.3 or 1.4 -> current. Now I'm going to test the patch while upgrading from the 1.3/1.4 versions. |
|
+1 to merge. The riak test I created fails without the patch and passes with it. So at least we know Riak Control should no longer crash a node in a mixed-cluster environment. |
|
Rebased to #122. |
|
Rather, rebased to #123. |
Some issues have been found when running a mixed cluster with Riak Control enabled. This tests the typical upgrade scenario of a customer and verifies that Riak Control at least doesn't crash any of the nodes while in a mixed cluster state. See basho/riak_control#120
I've confirmed a problem with a rolling upgrade between the 1.3 and 1.4 versions of Riak.
The root of the problem is that the node hosting Riak Control initiates a RPC request to all nodes, and pattern matches the result against a record. The format of this record changed between the 1.3 and 1.4 releases, without providing a new name for the record, so the pattern match fails.
This can occur when either:
1.) Riak Control hosted on 1.3 node makes a RPC request to a 1.4 node.
2.) Riak Control hosted on 1.4 node makes a RPC request to a 1.3 node.
Both of these scenarios trigger a badmatch exception in the riak_control_session module. This causes the Riak supervisor to restart the riak_control application, which then continues to crash when polling for the state until the node shuts down triggered by a maximum restart threshold.
This patch provides a few things:
member_inforecord, and the proposed-1.4.1 record the definitivemember_info_v2record.Updates
Rolling Upgrades
I would appreciate someone testing these upgrade scenarios as well to ensure total coverage.
I've left the commits broken out for the reviewer, but will rebase before merge.
@jonmeredith @russelldb @seancribbs @jgnewman @bsparrow435 @rzezeski